# Load packages and import data
library(readr)
library(ggplot2)
library(dplyr)
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv")Solutions for 05. Simple linear regression: model evaluation
Notes and in-class exercises
Exercise 1: Is the model correct?
The blue curved trend line shows a clear downward trend around 85 degrees, which contextually makes plenty of sense—extremely hot days would naturally see less riders. Overall the combination of the upward trend and downward trend makes for a curved relationship that is not captured well by a straight line of best fit.
ggplot(bikes, aes(x = temp_feel, y = riders_registered)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_smooth(method = "lm", se = FALSE, color = "red")`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Exercise 2: Fixing the model
The second plot (showing the model with squared temperature) follows the natural curve in the trend better.
bikes <- bikes %>%
mutate(temp_feel_squared = temp_feel^2)
# Plot the model WITHOUT squared temperature
ggplot(bikes, aes(x = temp_feel, y = riders_registered)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_smooth(method = "lm", se = FALSE, color = "red")`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
# Plot the model WITH squared temperature
ggplot(bikes, aes(x = temp_feel, y = riders_registered)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = FALSE, color = "red")`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Exercise 3: Residual plots
The first residual plot (from the model with just a straight line trend) shows a lingering trend in the residuals—the blue curve traces the trend in the residuals, and it does not lie flat on the y = 0 line.
On the other hand, the second residual plot (from the model which uses a squared term to allow for curvature) shows very little trend in the residuals—the blue curve is almost flat on the y = 0 line.
# Fit a linear model
bike_mod1 <- lm(riders_registered ~ temp_feel, data = bikes)
# Fit a quadratic model
bike_mod2 <- lm(riders_registered ~ temp_feel + temp_feel_squared, data = bikes)
# Check out the residual plot for bike_mod1 (the incorrect model)
ggplot(bike_mod1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Construct the residual plot for bike_mod2 (the good model)
ggplot(bike_mod2, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Exercise 4: Another example of an incorrect model
# Import the data
library(fivethirtyeight)
data(bechdel)
# Get only 1997 movies
movies_1997 <- bechdel %>%
filter(year == 1997)
# Construct the model
bechdel_model <- lm(intgross ~ budget, movies_1997)# Scatterplot of earnings and budget with linear and curved trend lines
ggplot(movies_1997, aes(x = budget, y = intgross)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_smooth(method = "lm", se = FALSE, color = "red")`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
# Residual plot for bechdel_model
ggplot(bechdel_model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
From the scatterplot, we can see that there is one movie that is a massive outlier in both budget and earnings, and this outlier is pulling up the trend line that makes the model for “regular” movies that have budgets and earnings in “normal” ranges.
The outlier movie is Titanic:
# One of many ways to filter to find the outlier movie!
movies_1997 %>%
filter(intgross > 2000000000)# A tibble: 1 × 15
year imdb title test clean_test binary budget domgross intgross code
<int> <chr> <chr> <chr> <ord> <chr> <int> <dbl> <dbl> <chr>
1 1997 tt0120338 Titanic ok ok PASS 2e8 6.59e8 2.19e9 1997…
# ℹ 5 more variables: budget_2013 <int>, domgross_2013 <dbl>,
# intgross_2013 <dbl>, period_code <int>, decade_code <int>
Exercise 5: Is the model strong? Developing R-squared intuition
The R-squared metric is a way to quantify the strength of a model. It measures how much variation in the outcome/response variable can be explained by the model.
Where does R-squared come from? Well, it turns out that we can partition the variance of the observed response values into the variability that’s explained by the model (the variance of the predictions) and the variability that’s left unexplained by the model (the variance of the residuals):
\[\text{Var(observed) = Var(predicted) + Var(residuals)}\]
“Good” models have residuals that don’t deviate far from 0. So the smaller the variance in the residuals (thus larger the variance in the predictions), the stronger the model. Take a look at the picture below and write a few sentences addressing the following:
- The first row corresponds to the weaker model. We can tell because the points are much more dispersed from the trend line than in the second row. Recall that the correlation metric measures how closely clustered points are about a straight line of best fit, so we would expect the correlation to be lower for the first row than the second row.
- The variance of the residuals is much lower for the second row—the residuals are all quite small. This indicates a stronger model.
Exercise 6: R-squared Interpretations
summary(bike_mod1)
Call:
lm(formula = riders_registered ~ temp_feel, data = bikes)
Residuals:
Min 1Q Median 3Q Max
-3607.1 -959.2 -153.8 998.2 3304.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -667.916 251.608 -2.655 0.00811 **
temp_feel 57.892 3.306 17.514 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1310 on 729 degrees of freedom
Multiple R-squared: 0.2961, Adjusted R-squared: 0.2952
F-statistic: 306.7 on 1 and 729 DF, p-value: < 2.2e-16
Multiple R-squared: 0.2961
Interpretation: 29.61% of the variation in number of registered riders on any given day can be explained by the variation in temperature (specifically, what temperature it “feels” like it is).
Exercise 7: Further exploring R-squared
In this exercise, we’ll look at data from a synthetic dataset called Anscombe’s quartet. Load the data in as follows, and look at the first few rows:
data(anscombe)
# Look at the first few rows
head(anscombe) x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
All of these models have close to the same intercept, slope, and R-squared!
anscombe_mod1 <- lm(y1 ~ x1, data = anscombe)
anscombe_mod2 <- lm(y2 ~ x2, data = anscombe)
anscombe_mod3 <- lm(y3 ~ x3, data = anscombe)
anscombe_mod4 <- lm(y4 ~ x4, data = anscombe)
summary(anscombe_mod1)
Call:
lm(formula = y1 ~ x1, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x1 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
summary(anscombe_mod2)
Call:
lm(formula = y2 ~ x2, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.9009 -0.7609 0.1291 0.9491 1.2691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001 1.125 2.667 0.02576 *
x2 0.500 0.118 4.239 0.00218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
summary(anscombe_mod3)
Call:
lm(formula = y3 ~ x3, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.1586 -0.6146 -0.2303 0.1540 3.2411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0025 1.1245 2.670 0.02562 *
x3 0.4997 0.1179 4.239 0.00218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
summary(anscombe_mod4)
Call:
lm(formula = y4 ~ x4, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.751 -0.831 0.000 0.809 1.839
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017 1.1239 2.671 0.02559 *
x4 0.4999 0.1178 4.243 0.00216 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
But when we look at the scatterplots, they all look substantially different, and we would want to approach our modeling differently for each one:
x1andy1: A linear model seems appropriate for this data.x2andy2: The scatterplot is clearly curved—a “linear” regression model with squared terms, for example, would be more appropriate for this data. (We’ll talk more about ways to handle nonlinear relationships soon!)x3andy3: There is a very clear outlier at aboutx3 = 13that we would want to dig into to better understand the context. After that investigation, we might consider removing this outlier and refitting the model.x4andy4: There is clearly something strange going on with most of the cases having anx4value of exactly 8. We would not want to jump straight into modeling. Instead, we should dig deeper to find out more about this data.
ggplot(anscombe, aes(x = x1, y = y1)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = FALSE)`geom_smooth()` using formula = 'y ~ x'
ggplot(anscombe, aes(x = x2, y = y2)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = FALSE)`geom_smooth()` using formula = 'y ~ x'
ggplot(anscombe, aes(x = x3, y = y3)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = FALSE)`geom_smooth()` using formula = 'y ~ x'
ggplot(anscombe, aes(x = x4, y = y4)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = FALSE)`geom_smooth()` using formula = 'y ~ x'
Exercises 8 - 11
No solutions for these exercises. These require longer discussions, not discrete answers.